REF: dont set ndarray.data in libreduction #34997

jbrockmendel · 2020-06-25T19:46:55Z

cc @WillAyd I'm still seeing 88 test failures locally and could use a fresh pair of eyes on this. Any thoughts?

pandas/_libs/reduction.pyx

jbrockmendel · 2020-07-01T22:14:36Z

Tentatively looks like the issue is that this patches _index_data without updating _data

WillAyd · 2020-07-08T17:03:09Z

If it helps I've noticed that changing this line:

pandas/pandas/_libs/reduction.pyx

Line 357 in 42fd7e7

chunk = slider.dummy

To chunk = slider.frame[starts[i]:ends[i]].copy() takes 88 failures down to about 20. Not a real solution as we don't want the copy, but I think this new design isn't fully compatible with some of the dummy work that the sliders do, so might need to patch together

Seems heading in the right direction though

jbrockmendel · 2020-07-08T17:41:36Z

closing to clear the queue

WillAyd

@jbrockmendel made a few updates to get failures down to a small handful. may or may not be the right way of doing things but hopefully continues this conversation

pandas/_libs/reduction.pyx

WillAyd · 2020-07-22T20:32:35Z

Of the four remaining failures one of them is a result of groupby.tshift which maybe we will deprecate in #34452

pandas/_libs/reduction.pyx

jbrockmendel · 2020-07-28T21:07:19Z

@WillAyd is there cause for optimism here? A solution here might have a bearing on one of the test failures in #35417

WillAyd · 2020-07-28T21:34:11Z

Well the optimism is that we are down to a handful of failures :-)

I think though that this is uncovering deep seeded issues in other parts of the code base. For instance, this is causing the test that #33439 added to fail which right now I've traced back to a potential bug somewhere in our hashing ocde, but I think it's only happen stance that that test works in the first place.

Not sure if the other 4 failures are related to hashing as well but will hopefully figure out soon

WillAyd · 2020-07-29T17:09:18Z

I've noticed that as an apply slides along various groups that this line of code looks problematic with the current changes:

pandas/pandas/_libs/index.pyx

Line 231 in 3b1d4f1

values = self._get_index_values()

The problem is that the call to _get_index_values hits a cached property that with this doesn't seem to be updated, and only ever references the index of the first group. So if you tried to split a DataFrame with 3 elements into [0], and [1, 2] groupings, a hashtable only ever gets built for the [0] group, the next iteration throws a KeyError when trying to figure out the location of [1, 2] and a whole slew of exception handling routes that off to space from there

@jbrockmendel any idea how the caching should be handled here? In the Python space the above call traces back to here:

pandas/pandas/core/indexes/base.py

Line 556 in 3b1d4f1

target_values = self._get_engine_target()

WillAyd · 2020-07-29T17:11:48Z

Here's the MRE I'm using to tackle the above comment.

df = pd.DataFrame({"A": ["S", "W", "W"], "B": [1.0, 1.0, 2.0]}) 
res = df.groupby("A").agg({"B": lambda x: x.get(x.index[-1])})

On master this yields

     B
A     
S  1.0
W  2.0

But with this PR yields

     B
A     
S  1.0
W  NaN

Because of the aforementioned KeyError when trying to locate elements by index in the second grouping

jbrockmendel · 2020-07-29T20:19:49Z

The problem is that the call to _get_index_values hits a cached property

I'm not sure I follow. In index.pyx this is a call to self.vgetter(), which I think is supposed to return _index_data, which gets patched within libreduction. Am I misunderstanding?

cache_readonlys on the Index object definitely seem like a footgun. In the general case, the only way I can think of to avoid this is to create a fresh Index object, but avoiding that overhead is basically the whole point of the shenanigans in libreduction.

WillAyd · 2020-07-29T20:27:58Z

Where is the call to _index_data? That would seem related but I wasn't seeing that; could be the missing link

jbrockmendel · 2020-07-29T23:37:42Z

Where is the call to _index_data? That would seem related but I wasn't seeing that; could be the missing link

I expected to see it in Index._engine, but apparently not

WillAyd · 2020-08-04T16:14:31Z

/azp run

azure-pipelines · 2020-08-04T16:14:40Z

Azure Pipelines successfully started running 1 pipeline(s).

jbrockmendel · 2020-08-18T22:50:19Z

Just pushed after applying a patch from #35417. Locally that gets me to 3 test failures (which now that i reread the thread may not actually be an improvement)

…f-blockwise-3

…dev#35799)

…ster

Co-authored-by: Matt Roeschke <[email protected]>

…index name (pandas-dev#36141)

…f-libreduction-5

jbrockmendel · 2020-09-17T21:00:45Z

Updated to implement NDFrame._can_use_libreduction to de-duplicate a bunch of similar (and inconsistently strict) checks done elsewhere. (Note an inline-comment I'll make about a place where this PR retains a different, inconsistent check)

This fixes an xfailed tests.groupby.test_apply.test_apply_with_timezones_aware

jbrockmendel · 2020-09-17T21:09:55Z

pandas/core/groupby/generic.py

-                #  see see test_groupby.test_basic
-                result = self._aggregate_named(func, *args, **kwargs)
+            if isinstance(
+                self._selected_obj.index, (DatetimeIndex, TimedeltaIndex, PeriodIndex)


this is the one place where i'm not using self._selected_obj._can_use_libreduction, as doing so would require 2.5 more kludges to get the tests passing:

below on 283 after ret = create_series_with_explicit_dtype would need to do

# Inference in the Series constructor may not infer # custom EA dtypes, so try here ret = maybe_cast_result(ret._values, obj, numeric_only=True) ret = Series(ret, index=index, name=obj.name)

1.5) in create_series_with_explicit_dtype would need to change dtype_if_empty=object to dtype_if_empty=obj.dtype (which im not 100% sure about)

The kludge here would have to be amended from and name in output.index to and (name in output.index or 0 in output.index)

…f-libreduction-5

jbrockmendel · 2020-09-18T23:15:58Z

Closing.

This doesn't do quite what I thought it did, will need a new approach, xref #36459.

WillAyd · 2020-09-19T00:50:58Z

This thing is super thorny...nice effort in any case here. We will figure it out one of these days

WillAyd reviewed Jun 25, 2020

View reviewed changes

pandas/_libs/reduction.pyx Show resolved Hide resolved

jbrockmendel mentioned this pull request Jul 7, 2020

Update setting data pointers for Cython 3 #34014

Closed

2 tasks

jbrockmendel closed this Jul 8, 2020

WillAyd reopened this Jul 22, 2020

WillAyd reviewed Jul 22, 2020

View reviewed changes

pandas/_libs/reduction.pyx Show resolved Hide resolved

ivirshup reviewed Jul 23, 2020

View reviewed changes

pandas/_libs/reduction.pyx Show resolved Hide resolved

jbrockmendel mentioned this pull request Aug 11, 2020

Fix cython3 #35675

Closed

5 tasks

jbrockmendel and others added 9 commits August 20, 2020 21:19

REF: remove unnecesary try/except

4c5eddd

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

c632c9f

…f-blockwise-3

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

9e64be3

…f-blockwise-3

TST: add test for agg on ordered categorical cols (pandas-dev#35630)

42649fb

TST: resample does not yield empty groups (pandas-dev#10603) (pandas-…

47121dd

…dev#35799)

revert accidental rebase

1decb3e

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

57c5dd3

…ster

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

a358463

…ster

Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…

ffa7ad7

…ster

mroeschke and others added 8 commits September 8, 2020 10:03

BUG: GroupbyRolling with an empty frame (pandas-dev#36208)

a56c6af

Co-authored-by: Matt Roeschke <[email protected]>

DOC: doc fix (pandas-dev#36205)

4a0152e

DOC: release date for 1.1.2 (pandas-dev#36182)

3aed293

Fixed pandas.json_normalize doctests errors` (pandas-dev#36207)

4c9add8

BUG: copying series into empty dataframe does not preserve dataframe …

11643bc

…index name (pandas-dev#36141)

CLN remove trailing commas (pandas-dev#36222)

edd802f

CLN: remove unused return value in _create_blocks (pandas-dev#36196)

9339b80

Make to_numeric default to correct precision (pandas-dev#36149)

070481c

jbrockmendel force-pushed the ref-libreduction-5 branch from 206a997 to 070481c Compare September 8, 2020 17:05

jbrockmendel mentioned this pull request Sep 12, 2020

BLD/CI: 3.9 support #36296

Closed

2 tasks

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

e8d42b0

…f-libreduction-5

jbrockmendel mentioned this pull request Sep 14, 2020

CI: Add stale PR action #36336

Merged

jbrockmendel added 8 commits September 14, 2020 13:19

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

c20f2cd

…f-libreduction-5

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

9220944

…f-libreduction-5

post-rebase fixup

4193e03

revert whitespace mixup

816f2fc

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

dfb3c10

…f-libreduction-5

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

1a318ef

…f-libreduction-5

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

4480c67

…f-libreduction-5

Implement _can_use_libreduction

865cb8b

jbrockmendel commented Sep 17, 2020

View reviewed changes

jbrockmendel marked this pull request as ready for review September 17, 2020 21:10

jbrockmendel added 2 commits September 18, 2020 14:06

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

a4d75da

…f-libreduction-5

Merge branch 'master' of https://github.com/pandas-dev/pandas into re…

7939eae

…f-libreduction-5

jbrockmendel closed this Sep 18, 2020

fangchenli mentioned this pull request Oct 9, 2020

BLD: remove blockslider #34014 #37006

Closed

jbrockmendel deleted the ref-libreduction-5 branch March 2, 2021 17:22

Uh oh!

REF: dont set ndarray.data in libreduction #34997

REF: dont set ndarray.data in libreduction #34997

Uh oh!

Conversation

jbrockmendel commented Jun 25, 2020

Uh oh!

Uh oh!

jbrockmendel commented Jul 1, 2020

Uh oh!

WillAyd commented Jul 8, 2020

Uh oh!

jbrockmendel commented Jul 8, 2020

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

WillAyd commented Jul 22, 2020

Uh oh!

Uh oh!

jbrockmendel commented Jul 28, 2020

Uh oh!

WillAyd commented Jul 28, 2020

Uh oh!

WillAyd commented Jul 29, 2020

Uh oh!

WillAyd commented Jul 29, 2020

Uh oh!

jbrockmendel commented Jul 29, 2020

Uh oh!

WillAyd commented Jul 29, 2020

Uh oh!

jbrockmendel commented Jul 29, 2020

Uh oh!

WillAyd commented Aug 4, 2020

Uh oh!

azure-pipelines bot commented Aug 4, 2020

Uh oh!

jbrockmendel commented Aug 18, 2020

Uh oh!

jbrockmendel commented Sep 17, 2020

Uh oh!

jbrockmendel Sep 17, 2020

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Sep 18, 2020

Uh oh!

WillAyd commented Sep 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

27 participants